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Abstract 

A pairwise clustering approach is applied to the analysis of the Dow Jones index 
companies, in order to identify similar temporal behavior of the traded stock prices. 
To this end, the chaotic map clustering algorithm is used, where a map is associated 
to each company and the correlation coefficients of the financial time series are 
associated to the coupling strengths between maps. The simulation of a chaotic 
map dynamics gives rise to a natural partition of the data, as companies belonging 
to the same industrial branch are often grouped together. The identification of 
clusters of companies of a given stock market index can be exploited in the portfolio 
optimization strategies. 
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1 Introduction 



Stock markets are recently triggering a growing interest in the physicists' 
community. The objective of this attention is to understand the underlying 
dynamics which rules the companies' stock prices. In particular, it would be 
useful to find, inside a given stock market index, groups of companies shar- 
ing similar temporal behavior. To this purpose, a clustering approach to the 
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problem may represent a good strategy. Clustering deals with the partitioning 
of a set of N elements into K clusters, based on a suitable (and not unique) 
similarity criterion [1]. Non-parametric methods represent the optimal strat- 
egy when a hierarchical structure, rather than a fixed partition, of the data 
should be obtained: this is the case with stock index dynamics and portfo- 
lio optimization strategies [2,3]. Examples of non-parametric methods are the 
linkage {agglomerative and divisive) algorithms [4], whose output is a den- 
drogram displaying the full hierarchy of the clustering solutions at different 
scales. The agglomerative approaches merge, at each step, the two clusters 
with the smallest distance, starting from clusters containing only one element. 
Here we use a non-parametric clustering approach, named chaotic map clus- 
tering (CMC) [5], which relies on the synchronization properties of a chaotic 
map system [6,7] to obtain a hierarchy of classes without any assumptions on 
the underlying structure of the data. 

This paper is organized as follows: in section 2 we give a brief review of the 
chaotic map algorithm, suitably modified for pairwise clustering of financial 
times series. Section 3 deals with the analysis of the companies' stock prices. 
Finally, some conclusions are drawn in section 4. 



2 Pairwise chaotic map clustering 

Chaotic map clustering has been introduced as a central algorithm, where 
the elements to cluster are embedded in a £>-dimensional feature space. In 
such a picture, the data-points are viewed as sites of a grid, hosting a chaotic 
map dynamics: the map variables Xi G [—1, 1], i = 1, . . . , A^, are assigned to 
each site of the lattice, and short-range interactions between neighboring maps 
are introduced as exponential decreasing function of the site distance. In the 
stationary regime, clusters of synchronized maps appear, corresponding to high 
density regions in the original data space. The mutual information between 
maps is used both as a similarity index for building the clusters, and a scale 
parameter for reconstructing the hierarchical tree. 

It should be remarked that a pairwise version of the algorithm can be easily 
implemented ii an N x N matrix of similarities (not necessarily distances in 
the mathematical sense) is provided instead of the feature vectors for all data. 
As far as one deals with clustering temporal patterns yi{t), the correlation 
coefficients Cy e [—1,1] seem to be a natural measure of similarity: 



where the temporal averages are computed over the time series length. In [8], 
the correlation coefficients between financial time series are used as entries into 
the super-paramagnetic clustering (SPC) algorithm [9,10]. The SPC algorithm 
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shares the same philosophy of the CMC approach, the physical system used 
to partition the data being an inhomogeneous ferromagnetic model: Potts 
spin Si are assigned, instead of map variables, to each data-point and short- 
range interactions between neighboring sites are introduced. The spin-spin 
correlation function replaces the mutual information as similarity index for 
clustering data. In the super-paramagnetic regime, domains of aligned spins 
appear, corresponding to the classes present in the data. 
KuUmann et al. generalize the SPC to the case of anti-ferromagnetic couplings 
by introducing the following spin-spin strength as a function of the correlation 
coefficients [8]: 



Jij = sgn(cij) (l - exp I 



n 



n 



(2) 



where the sign function sgn maps positive/negative correlations between com- 
panies' stock prices into positive/negative interactions between Potts spins, n 
is an even positive integer tuning the shape of the interaction function (whose 
value should be chosen so that a stable non-trivial partition can be obtained 
inside the hierarchical solution) , and a is the average of the largest correlation 
coefficients for each sequence [8]: 



N 



^max^Q^j 

i=i ^ 



(3) 



We shall try to follow a similar strategy in our CMC approach. We first ob- 
serve that, in order to implement a chaotic map dynamics, the correlation 
coefficients between financial time series should be mapped into positive in- 
teractions between maps, ranging in [0, 1]. Hence, we are naturally led to adopt 
the couplings (2) for > 0, while setting Jjj = for Cij < 0. In this way, 
we build up a partiaUy coupled map lattice with exponential increasing in- 
teractions between positively correlated companies. In the case of randomly 
coupled systems, although exact synchronization and formation of clusters of 
identical dynamical states are not found as in the globally coupled case [6], yet, 
clusters of almost synchronized maps are still observed, even for a significant 
fraction (up to 40 — 45 %) of lacking connections [7]. By retaining the interac- 
tions only between positively correlated time series, we bias the formation of 
almost synchronized maps to correspond to groups of companies sharing the 
same temporal behavior, while anti-correlated companies are likely to belong 
to different clusters. The chaotic map dynamics reads 

^^(^ + l) = ^E'^^./(^.(^)) ' (4) 

where f{x) = 1 — 2x'^ is the logistic map, Cj = J2j^iJij is a normalization 
factor, and r denotes the evolution time of the chaotic map system (not to be 
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confused with the real time t of the financial series). A detailed description 
of the above mentioned dynamics for clustering purposes is described else- 
where [5]; roughly speaking, after a certain equilibration time, the dynamics 
(4) yields a partition of the maps Xi into synchronized clusters, that remain 
stable during the remaining part of the r-evolution. Apphcations of the CMC 
algorithm cover a number of fields, such as buried land-mines detection by 
dynamic infrared imaging [11,12], human evolution study with mitochondrial 
DNA sequences [13], and diagnosis of pathological electroencephalographic 
patterns affected by Huntington's disease [14,15]. 



3 Application to financial time series 



Here we apply the CMC algorithm to cluster the companies of the Dow Jones 
(DJ) market index, including = 30 stocks, whose names are listed in Ap- 
pendix A, together with the identifying tickers and the related industrial 
branches. We first analyze one-year time periods, from 1998 to 2002. For each 
year, the correlation coefficients (1) are computed for the logarithmic daily 
price variation time series 



yi{t) = ln{Pi{t + l))-ln{Pi{t)) , (5) 
where Pi{t) is the closure price of stock i at day t. 

It should be remarked that, for each investigated period, the number of pairs 
of anti-correlated companies Nc<o is very small in comparison with the total 
number of pairs N{N — l)/2 = 435, and the mean value of the anticorrelations 
(c)c<o is almost zero (see table 1). At this point it should be stressed that the 
very fact that all stocks are correlated, and practically lack any anticorrelation, 
make not easy any possible clustering procedure. 

As a result of the processing, a dendrogram displays the hierarchical structure 
of the clusters at different values of the mutual information lij defined as 
follows: 

• extract a bitwise sequence from each map Xi{t), such that 

1, ifxi{t)>0; 



Si 



(6) 

0, otherwise; 



• evaluate the probability P{Si) as the number of times the state Si occurs 
along the sequence Si, normalized to the sequence length; in a similar way, 
P{Si, Sj) is the frequency of simultaneous occurrence of the states {Si, Sj) 
along the sequences S^ and Sj; 
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• compute the string entropy Hi and the joint entropy Hij as 

Hi^- P{S^) In P{S,) , (7) 

Si=0,l 

H^J = - E E PiS^,S,)ln P{S„ Sj) ; (8) 

5i=0,l Sj=0,l 

• the mutual information is then given by: lij — Hi + Hj — Hij; 

The mutual information is a measure of the correlations between maps [17], 
ranging between I^j = 0, for maps evolving independently, and I^j = ln(2), for 
exactly synchronized maps. For this reason, I^j can be appropriately adopted 
as a similarity index for clustering the companies: by cutting the dendrogram 
at a certain level / G [0,ln(2)], the clusters thus obtained are made up of 
companies whose associated maps are characterized by lij > I. The level / can 
be suitably chosen by relying on a certain stability criterion of the clustering 
solution. To this purpose, the cluster entropy S{I) [6] can be used to select 
the most stable partition among the whole hierarchy yielded by the algorithm, 
by looking for a plateau in the widest possible range of / values: 

Ni 

S{I)^-Y.Pj{k)lnPi{k) , (9) 
fc=i 

where Pi{k) is the fraction of elements belonging to cluster k, and Ni is the 
number of clusters found at level /. 

This model depends on one parameter, the positive even integer number n, 
which tunes the range of the interactions (2). For each period, the optimal 
value of the parameter n should be chosen according to the stability criterion 
of the entropy (9), at different cluster partitions. As an example, we consider 
the processing relative to the year 1999: figure 1 displays the entropy 5" in 
the plane spanned by / (mutual information) and n, with n = 2, 4, 6, . . . , 24. 
We choose n = 8 to be the optimal value, by looking for the widest range of 
constant values of S, along the /-direction (0.4 < / < 0.6). 
Once this parameter has been adjusted, the full hierarchy of clusters can be 
displayed by a dendrogram: figure 2 shows the result obtained for the year 
1999. The dendrogram has been cut in the region of stable partitions at / ~ 
0.6. For low value of mutual information, all pairs of companies are linked 
together in one single cluster, which splits into two big clusters at / = 0.16: 
on one side, we clearly recognize companies dealing mainly with capital goods 
(BA, CAT, HON) and basic materials (AA, DD, IP). On the other side, we find 
a cluster of strongly correlated companies represented by the branch marked 
by a star. This cluster, which gradually breaks as the mutual information 
approaches its maximum value I = ln(2), groups together different industrial 
branches: financial (C, AXP, JPM), services (DIS, HD, MOD, SBC, T, WMT), 
healthcare (JNJ, MRK), conglomerates (GE, UTX), consumer non-cyclical 
(GM, KO, MO, PG). Besides this cluster, it should be remarked the formation 
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of technological cores (IBM and HPQ, INTC and MSFT). 
This analysis has been carried out for each of the 5 years considered (1998- 
2002). In the following, we report the main clusters found for different years, 
together with the values chosen for the parameter n, and the values of the 
mutual information at which the dendrogram has been cut. Sub-clusters of 
companies belonging to the same industrial branch have been under braced: 

• Year 1998, n = 16, / = 0.62 

(1) DIS MCD T WMT KO MO PG JNJ MRK 

(2) AXP C JPM GM 

• Year 1999, n = 8, 7 = 0.24 

(1) DIS HD MCD SBC T WMT KO MO PG AXP C JPM 

JNJ MRK INTC ^ MSFT GE UTX GM 

(2) BA CAT HON DD IP EK XOM 

• Year 2000, n = 18, / = 0.26 

(1) BA CAT HON AA DD IP KO PG MMM UTX EK 
MCD 

(2) AXP C JPM SBC T GE GM 

• Year 2001, n = 20, 7 = 0.15 

(1) DIS HD MCD SBC T WMT BA CAT HON AXP C JPM 
AA DD IP GE MMM UTX EK GM MO XOM 

(2) HPQ IBM INTC MSFT 

• Year 2002, n = 16, 7 = 0.62 

(1) AA DD IP CAT HON MMM UTX GM MCD XOM; 

(2) AXP C JPM pis SBC EK GE MRK 

(3) HPQ IBM MSFT HD . 

It is worth stressing the presence of some cores of companies which remain 
strongly linked together over periods longer than one year: financial companies 
(AXP, C, JPM, 98-02), services (DIS, MCD, T, WMT, 98-99, 01), consumer 
non-cyclical (KO, MO, PG, 98-99), basic materials (AA, DD, IP, 00-02), cap- 
ital goods (BA, CAT, HON, 99-01), technology (HPQ, IBM, MSFT, 01-02), 
healthcare (JNJ, MRK, 98-99), conglomerates (MMM, UTX, 00-02). 
Once a partition of companies has been obtained, an efficient portfolio could be 
made of one "representative" stock per cluster, thus ensuring a diversification 
of the investment. The choice of the period length for computing the correla- 
tion coefficients should be related to the flexibility of the portfolio. From this 
point of view, an analysis covering the whole 5-year period should be based 
on more stable correlation coefficients, thus leading to more stable partitions 
(i.e., less hazardous investment), at the cost of a less flexible portfolio. In flg- 
ure 3, we report the full hierarchy of clusters found from the whole 5-year 
length time period {n = 18). We want to remark that no anticorrelations have 
been found for such period. The main branches of the dendrogram have been 
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marked by the industrial areas of the companies they are made of. 



4 Conclusions 



In the present work, a pairwise version of the chaotic map algorithm has been 
applied to the analysis of the companies' stocks belonging to the Dow Jones 
market index. The correlation coefficients between financial time series have 
been used as similarity measures to cluster the temporal patterns. Once the 
coupling interactions between maps are taken to be functions of this feature, 
the dynamics of such a system leads to the formation of clusters of companies 
that can often be identified as different industrial branches. The clustering 
output can be exploited to optimize the portfolio composition. 
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A Dow Jones stock mEirket companies 



AA: Alcoa Inc. - Basic Materials 

AXP: American Express Co. - Financial 

BA: Boeing - Capital Goods 

C: Citigroup - Financial 

CAT: Caterpillar - Capital Goods 

DD: DuPont - Basic Materials 

DIS: Walt Disney - Services 

EK: Eastman Kodak - Consumer Cyclical 

GE: General Electrics - Conglomerates 

GM: General Motors - Consumer Cyclical 

HD: Home Depot - Services 

HON: Honeywell International - Capital Goods 

HPQ: Hewlett-Packard - Technology 

IBM: International Business Machine - Technology 

INTC: Intel Corporation - Technology 

IP: International Paper - Basic Materials 

JNJ: Johnson & Johnson - Healthcare 

JPM: JP Morgan Chase - Financial 

KO: Coca Cola Inc. - Consumer Non-Cyclical 

MCD: McDonalds Corp. - Services 

MMM: Minnesota Mining - Conglomerates 

MO: Philip Morris - Consumer Non-Cyclical 

MRK: Merck & Co. - Healthcare 

MSFT: Microsoft - Technology 

PG: Procter & Gamble - Consumer Non-Cyclical 

SBC: SBC Communications - Services 

T: AT&T Gamble - Services 

UTX: United Technology - Conglomerates 

WMT: Wal-Mart Stores - Services 

XOM: Exxon Mobil - Energy 
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Year 


1998 


1999 


2000 


2001 


2002 


Nc<o 





25 


34 


11 


1 


(c)c<0 





-0.0453 


-0.0494 


-0.0495 


-0.0071 



Table 1 

Number of pairs of anti-correlated stocks A^c<o and mean value of the anticorrelation 
(c)c<0- ^c<o and (c)c<o rnust be compared with the total number of pairs N(N — 
l)/2 = 435 and with the mean correlation (c) ~ 0.28, respectively. 




Fig. 1. Cluster entropy S in the plane spanned by the mutual information / and 
the parameter n. The widest S-plateau along the /-direction (namely, the range of 
values of / for which S is constant) is 0.4 < / < 0.6 and corresponds to n = 8. This 
analysis refers to year 1999. 
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capital goods, basic materials 



financial, services, healthcare, technology, 
conglomerates, consumer non-cyclical 



AA MMM BA EK CAT HON DD IP XOM 



MSFTINTC MO UTX HPQ IBM 



Fig. 2. Dendrogram obtained for the year 1999 (ri = 8), cut in the region of stable 
partitions at J ~ 0.6. The branch marked by a star (not exphcitly shown) groups 
together different industrial sub-classes: financial (C, AXP, JPM), services (DIS, 
HD, MCD, SBC, T, WMT), healthcare (JNJ, MRK), conglomerates (GE, UTX), 
consumer non-cyclical (GM, KO, MO, PG). 



11 



0.2 - 



0.5 - 



(A 
(D 

o 

(D 
(A 



3 

wrro 

2. is 
5' o 

(A 



(D 

o 

3 

o 
o 

<Q 
>< 



m 
fi) 



AXPCAJPMGEGMSBCDISHDWMTWCDT UTXAA DD IP C MM XO EK BA HONHPMSFIBM INTKO MRKPGJNJ MO 



Fig. 3. Dendrogram found from the whole 5-yeax time period 1998-2002, with n = 18. 
The main branches have been marked by the industrial areas of the companies they 
are made of. 
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